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November 1997 Proceedings of the 1997 ACM/IEEE conference on Supercomputing 
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Full text available: ^ pdf(273.52 KB) Additional Information: full citation , abstract , references , citings 

The Starfire interconnect extends the envelope of Unix symmetric multiprocessor (SMP) 
systems in several dimensions. Interconnect: an active centerplane with four address 
routers and a 16x16 data crossbar provides 64 UltraSPARC processors with uniform 
memory access at a bandwidth of 10,667 M ps. Flexibility: Starfire can be dynamically 



~* reconfigured into multiple hard ware -protected operating system domains. Robustness: 
Failing boards can b^jpt swapped without interrupting sy ... 
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Chris Rowen, Steve Leibson 

June 2004 Proceedings of the 41st annual conference on Design automation 

Full text available: ^ pdf(96.22 KB) Additional Information: full citation , abstract , index terms 

This paper focuses on a particular SOC design technology and methodology, here called the 
advanced or processor-centric SOC design method, which reduces the risk of SOC design 
and increases ROI by using configurable processors to implement on-chip functions while 
increasing the SOCs flexibility through software programmability. The essential enabler for 
this design methodology is automatic processor generation-the rapid and easy creation of 
new microprocessor architectures, complete with effici ... 
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Full text available- 1 ^pdf(357 61 KB) Additional Information: full citation , abstract, references , citings, index 

terms 

This paper gives an overview of the BlueGene/L Supercomputer. This is a jointly funded 
research partnership between IBM and the Lawrence Livermore National Laboratory as part 
of the United States Department of Energy ASCI Advanced Architecture Research Program. 
Application performance and scaling studies have recently been initiated with partners at a 
number of academic and government institutions, including the San Diego Supercomputer 
Center and the California Institute of Technology. This mass ... 
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Yu-Kwong Kwok, Ishfaq Ahmad 

December 1999 ACM Computing Surveys (CSUR), Volume 31 Issue 4 

Full text available- g odf(723.58 KB) Additional Information: full citation , abstract, references , citings, index 
m ^ : terms 

Static scheduling of a program represented by a directed task graph on a multiprocessor 
system to minimize the program completion time is a well-known problem in parallel 
processing. Since finding an optimal schedule is an NP-complete problem in general, 
researchers have resorted to devising efficient heuristics. A plethora of heuristics have been 
proposed based on a wide spectrum of techniques, including branch-and-bound, integer- 
programming, searching, graph-theory, randomization, genetic ... 

Keywords: DAG, automatic parallelization, multiprocessors, parallel processing, software 
tools, static scheduling, task graphs 



11 Multiprocessor software design 
Peter Hibbard 

January 1980 Proceedings of the ACM 1980 annual conference 

Full text available: ^ pdf(669.64 KB) Additional Information: full citation , abstract , references , index terms 

Machines intended for parallel computations exhibit a wide variety of architectural designs, 
including pipeline, vector and array organizations, less traditional associative, data-flow and 
systolic organizations, and shared-memory MIMD organizations. It is not surprising, 
therefore, that the software support for these machines exhibits a wide variety of features 
reflecting the differing designs. Even within a single class of parallel machine, the system 
software used on different machines w ... 

12 Multiprocessor Organization — a Survey 
Philip Enslow 

January 1977 ACM Computing Surveys (CSUR), Volume 9 issue 1 

Full text available: ^pdf(1.79 MB) Additional Information: full citation , references , citings , index terms 



13 Disco: running commodity operating systems on scalable multiprocessors 
Edouard Bugnion, Scott Devine, Kinshuk Govil, Mendel Rosenblum 
November 1997 ACM Transactions on Computer Systems (TOCS), volume 15 issue 4 

Full text available: ffipdf(400.76 KB) Additional Information: full citation, abstract , references, citings, index 
L£=H ^ terms , review 

In this article we examine the problem of extending modern operating systems to run 
efficiently on large-scale shared-memory multiprocessors without a large implementation 
effort. Our approach brings back an idea popular in the 1970s: virtual machine monitors. 
We use virtual machines to run multiple commodity operating systems on a scalable 
multiprocessor. This solution addresses many of the challenges facing the system software 
for these machines. We demonstrate our approach with a prototy ... 



Keywords: scalable multiprocessors, virtual machines 
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Todd C. Mowry ^ ^ 

February 1998 ACM Transactions on Computer Systems (TOCS), volume 16 issue 1 

Full text available: gpdff410.70 KB) Additional Information: full citation , abstract, references , citings, index 

terms , review 

The large latency of memory accesses in large-scale shared-memory multiprocessors is a 
key obstacle to achieving high processor utilization. Software-controlled prefetching is a 
technique for tolerating memory latency by explicitly executing instructions to move data 
close to the processor before the data are actually needed. To minimize the burden on the 
programmer, compiler support is needed to automatically insert prefetch instructions into 
the code. A key challenge when ... 



Keywords: compiler optimization, prefetching 



15 Effective cache prefetching on bus-based multiprocessors 
Dean M. Tullsen, Susan J. Eggers 

February 1995 ACM Transactions on Computer Systems (TOCS), volume 13 issue l 

Full text available: « pdf(2.30 MB) Additional Information: full citation , abstract, references , citings, index 

terms 

Compiler-directed cache prefetching has the potential to hide much of the high memory 
latency seen by current and future high-performance processors. However, prefetching is 
not without costs, particularly on a shared-memory multiprocessor. Prefetching can 
negatively affect bus utilization, overall cache miss rates, memory latencies and data 
sharing. We simulate the effects of a compiler-directed prefetching algorithm, running on a 
range of bus-based multiprocessors. We show that, despite a ... 

Keywords: bus-based multiprocessors, cache prefetching, false sharing, memory latency 
hiding 



16 Cache coherence in large-scale shared-memory multiprocessors: issues and 
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David J. Lilja 
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17 Al g orithms for scalable synchronization on shared-memory multiprocessors 
John M. Mellor-Crummey, Michael L. Scott 

February 1991 ACM Transactions on Computer Systems (TOCS), Volume 9 issue l 

Full text available: fiBlpdff3-07 MB) Additional Information: full citation, abstract, references , citings, index 
10 terms , review 

Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in 
shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting 
tend to produce large amounts of memory and interconnect contention, introducing 
performance bottlenecks that become markedly more pronounced as applications scale. We 
argue that this problem is not fundamental, and that one can in fact construct busy-wait 
synchronization algorithms that induce no memory or interc ... 



18 Architectural primitives for a scalable shared memory multiprocessor 
Joonwon Lee, Umakishore Ramachandran 
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* 19 Waiting algorithms for synchronization in large-scale multiprocessors 
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August 1993 ACM Tran^Rions on Computer Systems (TOCS^/oiume u issue 3 

Full text available: 1 53 pdf(2.72 MB) Additional Information: full citation, abstract, references, citings, index 
^ terms 

Through analysis and experiments, this paper investigates two-phase waiting algorithms to 
minimize the cost of waiting for synchronization in large-scale multiprocessors. In a two- 
phase algorithm, a thread first waits by polling a synchronization variable. If the cost of 
polling reaches a limit Lpoll and further waiting is necessary, the thread is blocked, 
incurring an additional fixed cost, B. The choice of Lpoll 

Keywords: barriers, blocking, competitive analysis, locks, producer-consumer 
synchronization, spinning, waiting time 



20 Multiprocessor hardware: An architectural overview 
John Tartar 

January 1980 Proceedings of the ACM 1980 annual conference 

Full text available: 'g) pdf(797.11 KB) Additional Information: full citation , abstract , references , index terms 

The subject of multiprocessor computer systems has been discussed almost since the 
inception of the modern digital computer in its uniprocessor form. The motivation for 
multiprocessor system research and development activity arises from a consideration of one 
or more of the following factors: throughput flexibility extendability price/performance 
availability reliability fault tolerance. While any one of ... 
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